Introduction

Legend says that after the military victory against the Persians at the Battle of Marathon (490 BC), Greek soldier Pheidippides ran the 42.195 kilometres that separate Marathon and Athens to report the victory. What people seem to forget about this fable is that Pheidippides died immediately after delivering the message. And so long-distance running as we know was born…

This work analyses the 2017 Boston Marathon that took place on April 17th. Using data collected from 26,400 athletes, we will try to identify relationships between gender, age, ability, finishing times and consistency. We are going to use visualizations and statistics to support our claims.

Dataset

The dataset used for this study is available at https://www.kaggle.com/rojour/boston-results .

It consists of 26400 observations (athletes), each of which containing name, age, gender, country, city and state (where available), times at 9 different stages of the race, finish time and pace, overall place, gender place and division place.

Given the focus of our analysis, we’ll keep only information about the times, sex and gender of each athlete. The dataset is ordered by finishing time. Below can have a glimpse of the data of the first few runners to cross the line.

library(dplyr);
library(magrittr);
library(ggplot2);
library(lubridate);
library(readr);
df <- read.csv("./marathon_results_2017.csv", header=TRUE, stringsAsFactors=FALSE)
# Select only column
df <- df[c('Age', 'M.F', 'X5K', 'X10K', 'X15K', 'X20K', 'X25K', 'X30K', 'X35K', 'X40K', 'Official.Time')]
# Filter runners that had technical problems with recording apparatus
df %<>% filter(X5K != '-' & X10K != '-' & X15K != '-' & X20K != '-' & X25K != '-' & X30K != '-' & X35K != '-' & X40K != '-')
head(df)
print(paste("Number of runners: ", nrow(df)))
[1] "Number of runners:  26262"
print(paste("Number faulty observations: ", 26400 - nrow(df)))
[1] "Number faulty observations:  138"

As we can see, there are 8 columns that show the surpassed time at each 5K mark. Later we’ll do some pace variance analysis, so we’ll convert these times to be the time passed not from the start of the race, but from the mark of the previous 5K split.

This means that, for example, the X15K column will not tell how long it took the runner to run 15 kilometres, but how long it took the runner to run the last 5 kilomentres preceding the 15K mark, i.e., from 10K to 15K.

cols <- c('X5K', 'X10K', 'X15K', 'X20K', 'X25K', 'X30K', 'X35K', 'X40K')
df %<>% mutate_each_(funs(as.POSIXct(., format="%H:%M:%S")), cols);
df$X40K <- as.numeric(difftime(df$X40K, df$X35K, units='secs'))
df$X35K <- as.numeric(difftime(df$X35K, df$X30K, units='secs'))
df$X30K <- as.numeric(difftime(df$X30K, df$X25K, units='secs'))
df$X25K <- as.numeric(difftime(df$X25K, df$X20K, units='secs'))
df$X20K <- as.numeric(difftime(df$X20K, df$X15K, units='secs'))
df$X15K <- as.numeric(difftime(df$X15K, df$X10K, units='secs'))
df$X10K <- as.numeric(difftime(df$X10K, df$X5K, units='secs'))
df$X5K <- as.numeric(difftime(df$X5K, as.POSIXct('00:00:00', format="%H:%M:%S"), units='secs'))
colnames(df)[colnames(df) == 'M.F'] <- 'Gender'
head(df)

Now we can read the data frame above as: it took 925 seconds for runner #1 to run the first 5K of the race, 903s to run the next 5K, and so on.

But before we move to the analysis of the results, let’s take a look at the demographics of the population.

Inside a marathon there are many simultaneous events. There are men and women’s division, as well as 10 official divisons by age intervals. The main age group is the \([18, 39]\) year old, after that there are 9 age groups, but for sake of conciseness, we’ll divide athletes into 2 groups: young (belonging to the \([18, 39]\) age bracket), and 40+, for the remaining.

demo <- df %>%
  mutate(Gender, Gender = ifelse('M' == Gender,'MEN', 'WOMEN')) %>% 
  mutate(Age, Age = ifelse(Age >= 40, '40+', 'YOUNG')) %>% 
  group_by(Gender, Age) %>% 
  count()
demo$comb <- paste(demo$Age, demo$Gender)
# Pie Chart with Percentages
slices <- demo$n
lbls <- demo$comb
pct <- round(slices/sum(slices)*100)
lbls <- paste(lbls, pct) # add percents to labels 
lbls <- paste(lbls,"%",sep="") # ad % to labels 
pie(slices, labels = lbls, col=c("blue", "cyan", "violet", "pink"), main="Distribution of gender and age")

Given the pie chart above, we can see that our population composed of 55% men and 45% women.

As for age, 44% of people are in the “young” range, whilst the remaining are 40 years or older.

The largest group is 40+ men and the smallest is young men, which is interesting.

How gender and age influence finish-times

The graph below shows compares finishing times between men and women. Because the number of male and female runners differ, the histograms have been normalized.

The plot tells us that men tend to be faster than women. This is not unexpected, as there are physiological reasons why men are capable of faster finish-times than women.

df$Official.Time <- as.POSIXct(df$Official.Time, format="%H:%M:%S")
ggplot(df, aes(df$Official.Time, fill = df$Gender)) +
  geom_histogram(aes(y=..density..), alpha=0.5, 
                 position="identity", lwd=0.2) +
  ggtitle("Normalized finish-times for men vs. women")

df %>%
  mutate(Age, Age = ifelse(Age > 40, '40+', 'YOUNG')) %>% 
  ggplot(aes(Official.Time, fill = Age)) +
    geom_histogram(aes(y=..density..), alpha=0.5, 
                 position="identity", lwd=0.2) +
  ggtitle("Normalized finish-times for young vs. 40+")

n_groups <- 20
zebra_colormap <- rep(c("darkcyan", "cyan"), 20)
df <- df[df$Official.Time < quantile(df$Official.Time, 0.99), ]
#splits = split(df, cut(df$Official.Time, N))  # Time splits
#splits <- split(df, rep(1:ceiling(nrow(df)/N), each=N, length.out=nrow(df)))  # N marathoners splits
#df$group <- rep(1:ceiling(nrow(df)/n_groups), each=nrow(df)/n_groups, length.out=nrow(df))  # N marathoners splits
df$Official.Time <- as.numeric(difftime(df$Official.Time, as.POSIXct('00:00:00', format="%H:%M:%S"), units='mins'))
df$group <- cut(df$Official.Time, n_groups)
ggplot(df) + 
  geom_point(aes(x=1:NROW(df), y=df$Official.Time,  col=as.factor(df$group))) +
  scale_color_manual(values=zebra_colormap) +
  theme_bw()

Are women more disciplined than men?

women <- df %>%
  filter(Gender == 'F')
men <- df %>%
  filter(Gender == 'M')
b_splits = split(df, df$group)  # Time splits
w_splits = split(women, women$group)  # Time splits
m_splits = split(men, men$group)  # Time splits
g_sd_df <- data.frame("group" = numeric(0),
                      "gender" = character(0),
                      "n" = numeric(0),
                      "mean_sd" = numeric(0),
                      "sd_sd" = numeric(0),
                      stringsAsFactors = FALSE)
for (i in 1:n_groups) {
  gender <- 'M'
  mean_sd <- as.numeric(m_splits[[i]] %>%
                    select(cols) %>%
                    transform(SD=apply(., 1, sd, na.rm = TRUE)) %>%
                    summarize(sample_sd = mean(SD, na.rm = TRUE), sd(SD, na.rm = TRUE)))
  n <- as.numeric(m_splits[[i]] %>%
                    select(Official.Time) %>%
                    summarize(n = n()))
  g_sd_df[nrow(g_sd_df) + 1,] = c(i, gender, n, mean_sd, sd_sd)
  
  
  gender <- 'F'
  mean_sd <- as.numeric(w_splits[[i]] %>%
                    select(cols) %>%
                    transform(SD=apply(., 1, sd, na.rm = TRUE)) %>%
                    summarize(sample_sd = mean(SD, na.rm = TRUE), sd(SD, na.rm = TRUE)))
  n <- as.numeric(w_splits[[i]] %>%
                    select(Official.Time) %>%
                    summarize(n = n()))
  if (n != 0) {
    g_sd_df[nrow(g_sd_df) + 1,] = c(i, gender, n, mean_sd, sd_sd)
  }
  gender <- 'B'
  mean_sd <- as.numeric(b_splits[[i]] %>%
                    select(cols) %>%
                    transform(SD=apply(., 1, sd, na.rm = TRUE)) %>%
                    summarize(sample_sd = mean(SD, na.rm = TRUE)))
  n <- as.numeric(b_splits[[i]] %>%
                    select(Official.Time) %>%
                    summarize(n = n()))
  g_sd_df[nrow(g_sd_df) + 1,] = c(i, gender, n, mean_sd, sd_sd)
  
}
data length [6] is not a sub-multiple or multiple of the number of columns [5]data length [6] is not a sub-multiple or multiple of the number of columns [5]data length [6] is not a sub-multiple or multiple of the number of columns [5]data length [6] is not a sub-multiple or multiple of the number of columns [5]data length [6] is not a sub-multiple or multiple of the number of columns [5]data length [6] is not a sub-multiple or multiple of the number of columns [5]data length [6] is not a sub-multiple or multiple of the number of columns [5]data length [6] is not a sub-multiple or multiple of the number of columns [5]data length [6] is not a sub-multiple or multiple of the number of columns [5]data length [6] is not a sub-multiple or multiple of the number of columns [5]data length [6] is not a sub-multiple or multiple of the number of columns [5]data length [6] is not a sub-multiple or multiple of the number of columns [5]data length [6] is not a sub-multiple or multiple of the number of columns [5]data length [6] is not a sub-multiple or multiple of the number of columns [5]data length [6] is not a sub-multiple or multiple of the number of columns [5]data length [6] is not a sub-multiple or multiple of the number of columns [5]data length [6] is not a sub-multiple or multiple of the number of columns [5]data length [6] is not a sub-multiple or multiple of the number of columns [5]data length [6] is not a sub-multiple or multiple of the number of columns [5]data length [6] is not a sub-multiple or multiple of the number of columns [5]data length [6] is not a sub-multiple or multiple of the number of columns [5]data length [6] is not a sub-multiple or multiple of the number of columns [5]data length [6] is not a sub-multiple or multiple of the number of columns [5]data length [6] is not a sub-multiple or multiple of the number of columns [5]data length [6] is not a sub-multiple or multiple of the number of columns [5]data length [6] is not a sub-multiple or multiple of the number of columns [5]data length [6] is not a sub-multiple or multiple of the number of columns [5]data length [6] is not a sub-multiple or multiple of the number of columns [5]data length [6] is not a sub-multiple or multiple of the number of columns [5]data length [6] is not a sub-multiple or multiple of the number of columns [5]data length [6] is not a sub-multiple or multiple of the number of columns [5]data length [6] is not a sub-multiple or multiple of the number of columns [5]data length [6] is not a sub-multiple or multiple of the number of columns [5]data length [6] is not a sub-multiple or multiple of the number of columns [5]data length [6] is not a sub-multiple or multiple of the number of columns [5]data length [6] is not a sub-multiple or multiple of the number of columns [5]data length [6] is not a sub-multiple or multiple of the number of columns [5]data length [6] is not a sub-multiple or multiple of the number of columns [5]data length [6] is not a sub-multiple or multiple of the number of columns [5]
g_sd_df$n <- as.numeric(g_sd_df$n)
g_sd_df$mean_sd <- as.numeric(g_sd_df$mean_sd)
g_sd_df$sd_sd <- as.numeric(g_sd_df$sd_sd)
g_sd_df$group <- as.numeric(g_sd_df$group)
g_sd_df
ggplot(g_sd_df) +
  geom_point(aes(x=group, y=mean_sd, colour=gender)) +
  scale_color_manual(values=c('black', 'violet', 'blue')) +
  theme_bw()

https://www.youtube.com/watch?v=d7nLL6cUC0I

g_sd_df <- filter(g_sd_df, g_sd_df$gender == 'M' | g_sd_df$gender == 'F')
g_sd_df$min_error <- ((g_sd_df$n-1) * (g_sd_df$mean_sd * g_sd_df$mean_sd)) / qchisq(.025, df = g_sd_df$n - 1, lower.tail = FALSE)
g_sd_df$max_error <- ((g_sd_df$n-1) * (g_sd_df$mean_sd * g_sd_df$mean_sd)) / qchisq(.975, df = g_sd_df$n - 1, lower.tail = FALSE)
g_sd_df
ggplot(g_sd_df, aes(x=group, y=mean_sd, colour=gender)) +
  geom_errorbar(aes(ymin=min_error, ymax=max_error)) +
  scale_color_manual(values=c('violet', 'blue')) +
  theme_bw()

corr <- df %>%
  select(cols) %>%
  transform(SD=apply(., 1, sd, na.rm = TRUE)) %>%
  select(c('SD')) %>%
  as.data.frame()
corr <- bind_cols(corr, select(df, c('Official.Time')))
ggplot(corr, aes(x=Official.Time, y=SD)) +
  #stat_density_2d(geom = "raster", aes(fill = ..density..), contour = FALSE)
  #stat_density_2d(aes(fill = ..level..), geom = "polygon")
  #stat_density_2d(geom = "point", aes(size = ..density..), n = 20, contour = FALSE)
  geom_hex(binwidth = c(1, 20)) + 
  theme_bw()

if (!file.exists('animation.gif')) {
  library(animation)
  n_samples <- 20
  sample <- df %>%
    group_by(group) %>%
    sample_n(n_samples, replace=TRUE)
  
  times <- t(data.matrix(select(sample, cols)))
  
  makeplot <- function() {
    for(i in 1:nrow(sample)) {
    
      plot.ts(times[cols,1:i], 
              plot.type="single", 
              lwd=0.5, 
              col=rep(rainbow(n_groups), each=n_samples), 
              ylim=c(900, 3000), 
              xlab='', ylab='', axes = F)
      
      lines(times[cols,i], 
            lwd=2, col=1, 
            xlab='', ylab='', axes = F)
      
      title(main="5K pace analysis", sub=paste('Group rank #', as.character(ceiling(i/n_samples))), xlab="", ylab="Split time (seconds)")
      axis(side=2,at=c(800, 1000, 1500, 2000, 2500, 3000),labels=c('800', '1000', '1500', '2000', '2500', '3000'))
      axis(side=1,at=c(-10,1,2,3,4,5,6,7,8),labels=c('','5K', '10K', '15K', '20K', '25K', '30K', '35K', '40K'))
    }
  }
  oopt = ani.options(interval = 0, nmax = n_runners)
  saveGIF(makeplot(),interval = 0.1, width = 580, height = 400)
  ani.options(oopt)
}
---
title: "2017 Boston Marathon Analysis"
author: "Eduardo Faccin Vernier"
date: "December 2017"
output: html_notebook
---

# Introduction

Legend says that after the military victory against the Persians at the Battle of Marathon (490 BC), Greek soldier Pheidippides ran the 42.195 kilometres that separate Marathon and Athens to report the victory. What people seem to forget about this fable is that Pheidippides died *immediately* after delivering the message. And so long-distance running as we know was born...


This work analyses the 2017 Boston Marathon that took place on April 17th. Using data collected from 26,400 athletes, we will try to identify relationships between gender, age, ability, finishing times and consistency. We are going to use visualizations and statistics to support our claims.

![](coursemap.jpg)

# Dataset

The dataset used for this study is available at https://www.kaggle.com/rojour/boston-results .

It consists of 26400 observations (athletes), each of which containing name, age, gender, country, city and state (where available), times at 9 different stages of the race, finish time and pace, overall place, gender place and division place.

Given the focus of our analysis, we'll keep only information about the times, sex and gender of each athlete.
The dataset is ordered by finishing time. Below can have a glimpse of the data of the first few runners to cross the line.

```{r results='hide', message=FALSE, warning=FALSE}
# Import libraries
library(dplyr);
library(magrittr);
library(ggplot2);
library(lubridate);
library(readr);
```

```{r}
df <- read.csv("./marathon_results_2017.csv", header=TRUE, stringsAsFactors=FALSE)
# Select only column
df <- df[c('Age', 'M.F', 'X5K', 'X10K', 'X15K', 'X20K', 'X25K', 'X30K', 'X35K', 'X40K', 'Official.Time')]

# Filter runners that had technical problems with recording apparatus
df %<>% filter(X5K != '-' & X10K != '-' & X15K != '-' & X20K != '-' & X25K != '-' & X30K != '-' & X35K != '-' & X40K != '-')
head(df)
print(paste("Number of runners: ", nrow(df)))
print(paste("Number faulty observations: ", 26400 - nrow(df)))
```

As we can see, there are 8 columns that show the surpassed time at each 5K mark. Later we'll do some pace variance analysis, so we'll convert these times to be the time passed **not** from the start of the race, but from the mark of the previous 5K split.

This means that, for example, the X15K column will not tell how long it took the runner to run 15 kilometres, but how long it took the runner to run the last 5 kilomentres preceding the 15K mark, i.e., from 10K to 15K. 


```{r results='hide', message=FALSE, warning=FALSE}
cols <- c('X5K', 'X10K', 'X15K', 'X20K', 'X25K', 'X30K', 'X35K', 'X40K')
df %<>% mutate_each_(funs(as.POSIXct(., format="%H:%M:%S")), cols);

df$X40K <- as.numeric(difftime(df$X40K, df$X35K, units='secs'))
df$X35K <- as.numeric(difftime(df$X35K, df$X30K, units='secs'))
df$X30K <- as.numeric(difftime(df$X30K, df$X25K, units='secs'))
df$X25K <- as.numeric(difftime(df$X25K, df$X20K, units='secs'))
df$X20K <- as.numeric(difftime(df$X20K, df$X15K, units='secs'))
df$X15K <- as.numeric(difftime(df$X15K, df$X10K, units='secs'))
df$X10K <- as.numeric(difftime(df$X10K, df$X5K, units='secs'))
df$X5K <- as.numeric(difftime(df$X5K, as.POSIXct('00:00:00', format="%H:%M:%S"), units='secs'))
colnames(df)[colnames(df) == 'M.F'] <- 'Gender'
head(df)
```

Now we can read the data frame above as: it took 925 seconds for runner #1 to run the first 5K of the race, 903s to run the next 5K, and so on. 

But before we move to the analysis of the results, let's take a look at the demographics of the population.

Inside a marathon there are many simultaneous events. There are men and women's division, as well as 10 official divisons by age intervals. The main age group is the $[18, 39]$ year old, after that there are 9 age groups, but for sake of conciseness, we'll divide athletes into 2 groups: young (belonging to the $[18, 39]$ age bracket), and 40+, for the remaining.


```{r}
demo <- df %>%
  mutate(Gender, Gender = ifelse('M' == Gender,'MEN', 'WOMEN')) %>% 
  mutate(Age, Age = ifelse(Age >= 40, '40+', 'YOUNG')) %>% 
  group_by(Gender, Age) %>% 
  count()

demo$comb <- paste(demo$Age, demo$Gender)

# Pie Chart with Percentages
slices <- demo$n
lbls <- demo$comb
pct <- round(slices/sum(slices)*100)
lbls <- paste(lbls, pct) # add percents to labels 
lbls <- paste(lbls,"%",sep="") # ad % to labels 
pie(slices, labels = lbls, col=c("blue", "cyan", "violet", "pink"), main="Distribution of gender and age")
```

Given the pie chart above, we can see that our population composed of 55% men and 45% women.

As for age, 44% of people are in the "young" range, whilst the remaining are 40 years or older.

The largest group is 40+ men and the smallest is young men, which is interesting.

# How gender and age influence finish-times

The graph below shows compares finishing times between men and women. Because the number of male and female runners differ, the histograms have been normalized.

The plot tells us that men tend to be faster than women. This is not unexpected, as there are physiological reasons why men are capable of faster finish-times than women.


```{r}
df$Official.Time <- as.POSIXct(df$Official.Time, format="%H:%M:%S")
ggplot(df, aes(df$Official.Time, fill = df$Gender)) +
  geom_histogram(aes(y=..density..), alpha=0.5, 
                 position="identity", lwd=0.2) +
  ggtitle("Normalized finish-times for men vs. women")
```


```{r}
df %>%
  mutate(Age, Age = ifelse(Age > 40, '40+', 'YOUNG')) %>% 
  ggplot(aes(Official.Time, fill = Age)) +
    geom_histogram(aes(y=..density..), alpha=0.5, 
                 position="identity", lwd=0.2) +
  ggtitle("Normalized finish-times for young vs. 40+")

```



```{r}
n_groups <- 20
zebra_colormap <- rep(c("darkcyan", "cyan"), 20)

df <- df[df$Official.Time < quantile(df$Official.Time, 0.99), ]


#splits = split(df, cut(df$Official.Time, N))  # Time splits
#splits <- split(df, rep(1:ceiling(nrow(df)/N), each=N, length.out=nrow(df)))  # N marathoners splits
#df$group <- rep(1:ceiling(nrow(df)/n_groups), each=nrow(df)/n_groups, length.out=nrow(df))  # N marathoners splits
df$Official.Time <- as.numeric(difftime(df$Official.Time, as.POSIXct('00:00:00', format="%H:%M:%S"), units='mins'))
df$group <- cut(df$Official.Time, n_groups)

ggplot(df) + 
  geom_point(aes(x=1:NROW(df), y=df$Official.Time,  col=as.factor(df$group))) +
  scale_color_manual(values=zebra_colormap) +
  theme_bw()
```



Are women more disciplined than men?
```{r}
women <- df %>%
  filter(Gender == 'F')

men <- df %>%
  filter(Gender == 'M')

b_splits = split(df, df$group)  # Time splits
w_splits = split(women, women$group)  # Time splits
m_splits = split(men, men$group)  # Time splits


g_sd_df <- data.frame("group" = numeric(0),
                      "gender" = character(0),
                      "n" = numeric(0),
                      "mean_sd" = numeric(0),
                      "sd_sd" = numeric(0),
                      stringsAsFactors = FALSE)

for (i in 1:n_groups) {
  gender <- 'M'
  mean_sd <- as.numeric(m_splits[[i]] %>%
                    select(cols) %>%
                    transform(SD=apply(., 1, sd, na.rm = TRUE)) %>%
                    summarize(sample_sd = mean(SD, na.rm = TRUE), sd(SD, na.rm = TRUE)))
  n <- as.numeric(m_splits[[i]] %>%
                    select(Official.Time) %>%
                    summarize(n = n()))
  g_sd_df[nrow(g_sd_df) + 1,] = c(i, gender, n, mean_sd, sd_sd)
  
  
  gender <- 'F'
  mean_sd <- as.numeric(w_splits[[i]] %>%
                    select(cols) %>%
                    transform(SD=apply(., 1, sd, na.rm = TRUE)) %>%
                    summarize(sample_sd = mean(SD, na.rm = TRUE), sd(SD, na.rm = TRUE)))
  n <- as.numeric(w_splits[[i]] %>%
                    select(Official.Time) %>%
                    summarize(n = n()))
  if (n != 0) {
    g_sd_df[nrow(g_sd_df) + 1,] = c(i, gender, n, mean_sd, sd_sd)
  }

  gender <- 'B'
  mean_sd <- as.numeric(b_splits[[i]] %>%
                    select(cols) %>%
                    transform(SD=apply(., 1, sd, na.rm = TRUE)) %>%
                    summarize(sample_sd = mean(SD, na.rm = TRUE)))
  n <- as.numeric(b_splits[[i]] %>%
                    select(Official.Time) %>%
                    summarize(n = n()))
  g_sd_df[nrow(g_sd_df) + 1,] = c(i, gender, n, mean_sd, sd_sd)
  
}


g_sd_df$n <- as.numeric(g_sd_df$n)
g_sd_df$mean_sd <- as.numeric(g_sd_df$mean_sd)
g_sd_df$sd_sd <- as.numeric(g_sd_df$sd_sd)
g_sd_df$group <- as.numeric(g_sd_df$group)
g_sd_df
```


```{r}
ggplot(g_sd_df) +
  geom_point(aes(x=group, y=mean_sd, colour=gender)) +
  scale_color_manual(values=c('black', 'violet', 'blue')) +
  theme_bw()
```


https://www.youtube.com/watch?v=d7nLL6cUC0I
```{r}
g_sd_df <- filter(g_sd_df, g_sd_df$gender == 'M' | g_sd_df$gender == 'F')
g_sd_df$min_error <- ((g_sd_df$n-1) * (g_sd_df$mean_sd * g_sd_df$mean_sd)) / qchisq(.025, df = g_sd_df$n - 1, lower.tail = FALSE)
g_sd_df$max_error <- ((g_sd_df$n-1) * (g_sd_df$mean_sd * g_sd_df$mean_sd)) / qchisq(.975, df = g_sd_df$n - 1, lower.tail = FALSE)
g_sd_df

ggplot(g_sd_df, aes(x=group, y=mean_sd, colour=gender)) +
  geom_errorbar(aes(ymin=min_error, ymax=max_error)) +
  scale_color_manual(values=c('violet', 'blue')) +
  theme_bw()
```


```{r}
corr <- df %>%
  select(cols) %>%
  transform(SD=apply(., 1, sd, na.rm = TRUE)) %>%
  select(c('SD')) %>%
  as.data.frame()

corr <- bind_cols(corr, select(df, c('Official.Time')))

ggplot(corr, aes(x=Official.Time, y=SD)) +
  #stat_density_2d(geom = "raster", aes(fill = ..density..), contour = FALSE)
  #stat_density_2d(aes(fill = ..level..), geom = "polygon")
  #stat_density_2d(geom = "point", aes(size = ..density..), n = 20, contour = FALSE)
  geom_hex(binwidth = c(1, 20)) + 
  theme_bw()
```


```{r}
if (!file.exists('animation.gif')) {
  library(animation)

  n_samples <- 20
  sample <- df %>%
    group_by(group) %>%
    sample_n(n_samples, replace=TRUE)
  
  times <- t(data.matrix(select(sample, cols)))
  
  makeplot <- function() {
    for(i in 1:nrow(sample)) {
    
      plot.ts(times[cols,1:i], 
              plot.type="single", 
              lwd=0.5, 
              col=rep(rainbow(n_groups), each=n_samples), 
              ylim=c(900, 3000), 
              xlab='', ylab='', axes = F)
      
      lines(times[cols,i], 
            lwd=2, col=1, 
            xlab='', ylab='', axes = F)
      
      title(main="5K pace analysis", sub=paste('Group rank #', as.character(ceiling(i/n_samples))), xlab="", ylab="Split time (seconds)")
      axis(side=2,at=c(800, 1000, 1500, 2000, 2500, 3000),labels=c('800', '1000', '1500', '2000', '2500', '3000'))
      axis(side=1,at=c(-10,1,2,3,4,5,6,7,8),labels=c('','5K', '10K', '15K', '20K', '25K', '30K', '35K', '40K'))
    }
  }
  oopt = ani.options(interval = 0, nmax = n_runners)
  saveGIF(makeplot(),interval = 0.1, width = 580, height = 400)
  ani.options(oopt)
}
```


![](animation.gif)

 





